Re: System / libata IDE controller woes (long)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I had the same problem you did when I put 3 identical controllers 
together.  To get around that problem I used 2 TX133s and 1 TX100x2.  I 
believe this is the root cause of your problems.

Justin.

On Tue, 26 Dec 2006, Erik Ohrnberger wrote:

> First off, Merry Christmas, Seasons Greetings and Happy Holidays!
> 
> Hang on, this is a bit of a long story, but I think that you'll need the
> information and background.
> 
> I want what amounts to a NAS, that I'd like to build on gentoo Linux.  I'm
> familiar with gentoo and the use of EVMS, so I think I'm pretty well
> prepared from this perspective.
> 
> Earlier this year, when I started putting it together, I gathered my
> hardware.  A decent 2 GHz Athlon system with 512 MB RAM, DVD drive, a 40 GB
> system drive, and a 500 Watt power supply.  Then I started adding hard
> disks.  To date, I've got 5 80 GB PATA, 2 200 GB PATA, and 1 60 GB PATA.
> 
> I mounted the drives on a set of aluminum rails that I had a friend make for
> me.  They run vertically, and have slots through which screws are tightened
> into the normal hard drive's mounting holes.  All the communication cables
> are 80 pin cables, and run pretty much straight to the controller cards,
> while the power pigtails fan out on the side of the 'tower'.
> 
> With all these hard drives, I also got 3 Promise 20269 IDE controllers.
> After I put it all together, and creating 2 logical volumes, one linked EVMS
> LV, and one RAID5 across 5 80 GB drives.  To support this configuration, I
> connected the drives in the follow manner (using /dev/hdX notation):
> 
> ide0:	/dev/hdc = System boot disk	(Motherboard)
> 	/dev/hdb = DVD ROM
> ide1:	/dev/hdc = nothing
> 	/dev/hdd = nothing
> ide2:	/dev/hde = raid disk		(First Promise card)
> 	/dev/hdf = lvm disk
> ide3:	/dev/hdg = raid disk
> 	/dev/hdh = lvm disk
> ide4:	/dev/hdi = raid disk		(Second Promise card)
> 	/dev/hdj = lvm disk
> ide5:	/dev/hdk = raid disk
> 	/dev/hdl = nothing
> ide6:	/dev/hdm = raid disk		(Thrid Promise card)
> 	/dev/hdn = nothing
> ide7:	/dev/hdo = nothing
> 	/dev/hdp = nothing
> 
> >From what I understood, this is how you want to connect a set of raid drives
> so that no one controller is over loaded with IO.  But I had to use the
> other ports to connect the LVM.
> 
> I started to get 'dma_expiry' errors (see message file extract below):
> 
> Dec 22 21:29:33 livecd hdg: dma_timer_expiry: dma status == 0x21
> Dec 22 21:29:43 livecd hdg: DMA timeout error
> Dec 22 21:29:43 livecd hdg: dma timeout error: status=0x50 { DriveReady
> SeekComplete }
> Dec 22 21:29:43 livecd ide: failed opcode was: unknown
> Dec 22 21:29:43 livecd hdg: task_in_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Dec 22 21:29:43 livecd hdg: task_in_intr: error=0x04 { DriveStatusError }
> Dec 22 21:29:43 livecd ide: failed opcode was: unknown
> Dec 22 21:29:43 livecd hdg: task_in_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Dec 22 21:29:43 livecd hdg: task_in_intr: error=0x04 { DriveStatusError }
> Dec 22 21:29:43 livecd ide: failed opcode was: unknown
> Dec 22 21:29:43 livecd hdg: task_in_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Dec 22 21:29:43 livecd hdg: task_in_intr: error=0x04 { DriveStatusError }
> Dec 22 21:29:43 livecd ide: failed opcode was: unknown
> Dec 22 21:29:43 livecd hdg: task_in_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Dec 22 21:29:43 livecd hdg: task_in_intr: error=0x04 { DriveStatusError }
> Dec 22 21:29:43 livecd ide: failed opcode was: unknown
> Dec 22 21:29:43 livecd PDC202XX: Secondary channel reset.
> Dec 22 21:29:43 livecd ide3: reset: success
> Dec 22 21:30:03 livecd hdg: dma_timer_expiry: dma status == 0x21
> Dec 22 21:30:15 livecd hdg: DMA timeout error
> Dec 22 21:30:15 livecd hdg: dma timeout error: status=0x80 { Busy }
> Dec 22 21:30:15 livecd ide: failed opcode was: unknown
> Dec 22 21:30:15 livecd hdg: DMA disabled
> Dec 22 21:30:15 livecd PDC202XX: Secondary channel reset.
> Dec 22 21:30:20 livecd ide3: reset: success
> Dec 22 21:36:58 livecd hdg: irq timeout: status=0x80 { Busy }
> Dec 22 21:36:58 livecd ide: failed opcode was: unknown
> Dec 22 21:36:58 livecd PDC202XX: Secondary channel reset.
> Dec 22 21:37:33 livecd ide3: reset timed-out, status=0x80
> Dec 22 21:37:33 livecd hdg: status timeout: status=0x80 { Busy }
> Dec 22 21:37:33 livecd ide: failed opcode was: unknown
> Dec 22 21:37:33 livecd PDC202XX: Secondary channel reset.
> Dec 22 21:37:33 livecd hdg: drive not ready for command
> Dec 22 21:37:48 livecd ide3: reset: success
> Dec 22 21:37:58 livecd hdg: lost interrupt
> 
> These errors caused the raid array to crash repeatedly, so I gave up on that
> and changed the raid to an EVMS drive linked logical volume, and changed
> their connections to as follows:
> 
> ide0:	/dev/hdc = System boot disk	(motherboard)
> 	/dev/hdb = DVD ROM
> ide1:	/dev/hdc = nothing
> 	/dev/hdd = nothing
> ide2:	/dev/hde = lvm1			(first promise card)
> 	/dev/hdf = lvm1
> ide3:	/dev/hdg = lvm1
> 	/dev/hdh = lvm1
> ide4:	/dev/hdi = lvm1			(second promise card)
> 	/dev/hdj = nothing
> ide5:	/dev/hdk = lvm2
> 	/dev/hdl = lvm2
> ide6:	/dev/hdm = lvm2			(third promise card)
> 	/dev/hdn = nothing
> ide7:	/dev/hdo = nothing
> 	/dev/hdp = nothing
> 
> Still got the same dma_timer_expiry errors. I consulted this list as to how
> to resolve them.  The wisdom of the list recommended that I try libata
> rather than the old ide controller code.  So I patched the kernel, and all
> was well, for quite some time.
> 
> But then I started to get random lockups.  I upgraded the kernel to 2.6.19,
> which has all the libata code in it, and ran it.  This didn't help.  I
> enabled nmi_watchdog in order to track down which drive was causing the
> problems.  It helped, and pointed to a drive (see message log file extract
> below):
> 
> Dec 25 03:13:23 storage ATA: abnormal status 0x80 on port 0xE0A817DF
> Dec 25 03:13:23 storage ATA: abnormal status 0x80 on port 0xE0A817DF
> Dec 25 03:13:23 storage ATA: abnormal status 0x80 on port 0xE0A817DF
> Dec 25 03:13:53 storage ata5.01: exception Emask 0x0 SAct 0x0 SErr 0x0
> action 0x2 frozen
> Dec 25 03:13:53 storage ata5.01: (BMDMA stat 0x1)
> Dec 25 03:13:53 storage ata5.01: tag 0 cmd 0xc8 Emask 0x4 stat 0x40 err 0x0
> (timeout)
> Dec 25 03:14:00 storage ata5: port is slow to respond, please be patient
> (Status 0x80)
> Dec 25 03:14:23 storage ata5: port failed to respond (30 secs, Status 0x80)
> Dec 25 03:14:23 storage ata5: soft resetting port
> Dec 25 03:14:30 storage ata5: port is slow to respond, please be patient
> (Status 0xd0)
> Dec 25 03:14:53 storage ata5: port failed to respond (30 secs, Status 0xd0)
> Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:15:23 storage ata5.00: qc timeout (cmd 0xec)
> Dec 25 03:15:23 storage ata5.00: failed to IDENTIFY (I/O error,
> err_mask=0x4)
> Dec 25 03:15:23 storage ata5.00: revalidation failed (errno=-5)
> Dec 25 03:15:23 storage ata5: failed to recover some devices, retrying in 5
> secs
> Dec 25 03:15:28 storage ata5: soft resetting port
> Dec 25 03:15:35 storage ata5: port is slow to respond, please be patient
> (Status 0xd0)
> Dec 25 03:15:58 storage ata5: port failed to respond (30 secs, Status 0xd0)
> Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:16:28 storage ata5.00: qc timeout (cmd 0xec)
> Dec 25 03:16:28 storage ata5.00: failed to IDENTIFY (I/O error,
> err_mask=0x4)
> Dec 25 03:16:28 storage ata5.00: revalidation failed (errno=-5)
> Dec 25 03:16:28 storage ata5: failed to recover some devices, retrying in 5
> secs
> Dec 25 03:16:33 storage ata5: soft resetting port
> Dec 25 03:16:41 storage ata5: port is slow to respond, please be patient
> (Status 0xd0)
> Dec 25 03:17:04 storage ata5: port failed to respond (30 secs, Status 0xd0)
> Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:17:34 storage ata5.00: qc timeout (cmd 0xec)
> Dec 25 03:17:34 storage ata5.00: failed to IDENTIFY (I/O error,
> err_mask=0x4)
> Dec 25 03:17:34 storage ata5.00: revalidation failed (errno=-5)
> Dec 25 03:17:34 storage ata5.00: disabled
> Dec 25 03:17:34 storage ata5: failed to recover some devices, retrying in 5
> secs
> Dec 25 03:17:39 storage ATA: abnormal status 0x80 on port 0xE0A817DF
> Dec 25 03:17:39 storage ata5.01: failed to IDENTIFY (I/O error,
> err_mask=0x40)
> Dec 25 03:17:39 storage ata5.01: revalidation failed (errno=-5)
> Dec 25 03:17:39 storage ata5: failed to recover some devices, retrying in 5
> secs
> Dec 25 03:17:51 storage ata5: port is slow to respond, please be patient
> (Status 0x80)
> Dec 25 03:18:14 storage ata5: port failed to respond (30 secs, Status 0x80)
> Dec 25 03:18:14 storage ata5: soft resetting port
> Dec 25 03:18:21 storage ata5: port is slow to respond, please be patient
> (Status 0xd0)
> Dec 25 03:18:44 storage ata5: port failed to respond (30 secs, Status 0xd0)
> Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:19:14 storage ata5.01: qc timeout (cmd 0xec)
> Dec 25 03:19:14 storage ata5.01: failed to IDENTIFY (I/O error,
> err_mask=0x4)
> Dec 25 03:19:14 storage ata5.01: revalidation failed (errno=-5)
> Dec 25 03:19:14 storage ata5: failed to recover some devices, retrying in 5
> secs
> Dec 25 03:19:19 storage ata5: soft resetting port
> Dec 25 03:19:26 storage ata5: port is slow to respond, please be patient
> (Status 0xd0)
> Dec 25 03:19:49 storage ata5: port failed to respond (30 secs, Status 0xd0)
> Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:19:49 storage ATA: abnormal status 0xD2 on port 0xE0A817DF
> Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:20:19 storage ata5.01: qc timeout (cmd 0xec)
> Dec 25 03:20:19 storage ata5.01: failed to IDENTIFY (I/O error,
> err_mask=0x4)
> Dec 25 03:20:19 storage ata5.01: revalidation failed (errno=-5)
> Dec 25 03:20:19 storage ata5.01: disabled
> Dec 25 03:20:20 storage ata5: EH complete
> Dec 25 03:20:20 storage sd 4:0:1:0: SCSI error: return code = 0x00040000
> Dec 25 03:20:20 storage end_request: I/O error, dev sdg, sector 271144
> 
> However, when I take the drives off that system put them another on
> another's motherboard IDE connection, and run badblocks in read only mode on
> all the drives, I get no errors, not a single one on any of the drives.  So
> if it's not the physical IO that causing problems, it must be the interface?
> 
> Something is clearly wrong here, and I'm at a loss as to what it may be or
> how to resolve it.
> 
> 1). Could it be that this is just too many PCI IDE drive controllers in one
> system?  That they are fighting each other for resources when they are all
> being read from and written to?
> 
> 2). If it is in fact that there are too many IDE controllers, is there any
> advantage in a single board with many drive connections over many boards?
> 
> 3). Are there any BIOS or kernel settings that I could make that would
> resolve what may be this resource contention?  A slower CPU? (I have a
> similar 900 MHz Athlon system and file serving is hardly CPU intensive).
> 
> 4). Is it that the Promise 20269 are bad choice of controllers and I should
> change to a different ones?  If so, what is a good choice?
> 
> 5). Should I consider migrating everything to SATA with SIL 3114
> controllers?
> 
> 6). Should I consider reducing the number of smaller disks in favor or fewer
> larger ones?
> 
> 7). What if I add a SIL 3114 SATA controller and SATA disks to migrate off,
> will I cause the same issue by adding yet another PCI hard disk controller?
> 
> Lspci output for further reference:
> 00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?)
> (rev c1)
> 00:00.1 RAM memory: nVidia Corporation nForce2 Memory Controller 0 (rev c1)
> 00:00.2 RAM memory: nVidia Corporation nForce2 Memory Controller 4 (rev c1)
> 00:00.3 RAM memory: nVidia Corporation nForce2 Memory Controller 3 (rev c1)
> 00:00.4 RAM memory: nVidia Corporation nForce2 Memory Controller 2 (rev c1)
> 00:00.5 RAM memory: nVidia Corporation nForce2 Memory Controller 5 (rev c1)
> 00:01.0 ISA bridge: nVidia Corporation nForce2 ISA Bridge (rev a4)
> 00:01.1 SMBus: nVidia Corporation nForce2 SMBus (MCP) (rev a2)
> 00:08.0 PCI bridge: nVidia Corporation nForce2 External PCI Bridge (rev a3)
> 00:09.0 IDE interface: nVidia Corporation nForce2 IDE (rev a2)
> 00:1e.0 PCI bridge: nVidia Corporation nForce2 AGP (rev c1)
> 01:06.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)
> 01:07.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)
> 01:08.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)
> 01:0a.0 Ethernet controller: Accton Technology Corporation SMC2-1211TX (rev
> 10)
> 02:00.0 VGA compatible controller: ATI Technologies Inc Radeon R200 QM
> [Radeon 9100]
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux