Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Last night I discovered a problem in my RAID5 array
and finally after a lot of tests I narrowed it down to
a bad sector on one of the hard disks and some goofy
kernels.

I just yesterday build a new PC using an existing
array of 5 disks in RAID 5. I did build the array with
only 4 out of 5 disks in the system but the rebuild
processes stopped over and over again apparently at
the same position. At last I found out that the
harddisk at the first SATA port had developed some bad
sectors which made the kernel stop completely when it
tried to read that sector with the following error on
the screen:

HARDWARE ERROR
CPU 0: Machine Check Exception: 4  Bank 4:
b200000000070f0f
TSC b7d4a144d0
This is not a software problem!
Run through mcelog --ascii to decode and contact your
hardware vendor
Kernel panic - not syncing: Machine check

Googling around made me check memory, upgrade the BIOS
and things like that but now i DO think that this IS a
software problem, which is in the linux kernel.

I was running the standard 2.6.20-16 kernel series
from Ubuntu Feisty Fawn (using the generic and server
built) and I built my own 2.6.22.1 but the problem
still persisted. When copying manually with dd_rescue
I was not able to copy past the bad sector or the MCE
error reappeared. Only when using the standard Ubuntu
Edgy Eft kernel (2.6.17-12-server) the problem went
away completely and the syslog was filled with normal
lines like: 

Jul 28 22:58:26 mediaserver kernel: [ 6562.446868]
ata2: error=0x40 { UncorrectableError }
Jul 28 22:58:26 mediaserver kernel: [ 6562.446875] sd
1:0:0:0: SCSI error: return code = 0x8000002
Jul 28 22:58:26 mediaserver kernel: [ 6562.446880]    
Additional sense: Unrecovered read error - auto
reallocate failed
Jul 28 22:58:26 mediaserver kernel: [ 6562.446887]
end_request: I/O error, dev sda, sector 205534870

So in the end I was able to copy my stuff off the bad
harddisk to a new disk (losing some bytes because of
my already dirty RAID5 array) but I do think this is a
kernel bug or at least strange behavior as an old
kernel is willing to continue operation on something
'minor' as a bad sector. In the end when I will start
scrubbing the drive array overnight a simple bad
sector on the array will take down the complete system
instead of just continuing with 1 faulty drive in the
array!

Some information about the hardware:
AMD Athlon 64 3000+
Asus A8N-E Deluxe motherboard
1 GB RAM
4 Seagate 7200.9 drives on the NVIDIA SATA controller
(sda ... sdd)
2 WD drives on the IDE controller (hda, hdc)
Running Feisty Fawn 64 bit Server edition

Faulty drive is /dev/sda and on thus on the first SATA
port. Changing this to a different port on the
motherboard gives the same lockup. There is also a SIL
3114 controller on the motherboard but I have not
tried to dd_rescue with the faulty drive on that
controller to see if it locks up the kernel.

Regards,
Hendrik van den Boogaard




      ____________________________________________________________________________________
Luggage? GPS? Comic books? 
Check out fitting gifts for grads at Yahoo! Search
http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux