VIA SATA Raid needs a long time to recover from suspend

I have been debugging a power management problem for a few days now, andI believe I have finally solved the problem. Because it involvedpatching the kernel, I felt I should share the fix here in hopes that itcan be improved and/or integrated into future kernels. Right now I amrunning 2.6.14.2 on amd64, compiled myself, with the ubuntu breezy amd64distribution.First I'll state the fix. It involved changing two lines ininclude/linux/libata.h:

static inline u8 ata_busy_wait(struct ata_port *ap, unsigned int bits,
                  unsigned int max)
{
   u8 status;

   do {

udelay(100); <-- changed to 100from 10

       status = ata_chk_status(ap);
       max--;
   } while ((status & bits) && (max > 0));

   return status;
}

and:

static inline u8 ata_wait_idle(struct ata_port *ap)
{

u8 status = ata_busy_wait(ap, ATA_BUSY | ATA_DRQ,10000); <-- changed to 10,000 from 1,000

   if (status & (ATA_BUSY | ATA_DRQ)) {
       unsigned long l = ap->ioaddr.status_addr;
       printk(KERN_WARNING
              "ATA: abnormal status 0x%X on port 0x%lX\n",
              status, l);
   }

   return status;
}

The problem seems to be that my VIA SATA raid controller requires moretime to recover from being suspended. It looks like the code insata_via.c restores the task file after a resume, then callsata_wait_idle to wait for the busy bit to clear. The problem was thatthis function timed out before the busy bit cleared, resulting inmessages like this:

ATA: abnormal status 0x80 on port 0xE007

Then if there was an IO request made immediately after resuming, itwould timeout and fail, because it was issued before the hardware wasready. Changing the timeout resolved this. I tried changing both theudelay and ata_busy_wait lines to increase the timeout, and it did notseem to matter which I changed, as long as the total timeout wasincreased by a factor of 100.Since increasing the maximum timeout, suspend and hibernate work greatfor me. While experiencing this bug, it may have exposed another bug,which I will mention now in passing. As I said before, after a resume,if there was an IO request made immediately ( before the busy bitfinally did clear ) it would timeout and fail. It seemed the kernelfilled the buffer cache for the requested block with garbage rather thanretry the read. It seems to me that at some point, the read should havebeen retried. The symptoms of this were:

1) When suspend.sh called resume.sh immediately after the echo mem >/sys/power/state line, then on resume, the read would fail in a block inthe resierfs tree that was required to lookup the resume.sh file. Thiscaused reiserfs to complain about errors in the node, and the scriptfailed to execute. Further attempts to touch the script, even with ls-al /etc/acpi/resume.sh failed with EPERM. I would think that at worst,this should fail with EIO or something, not EPERM.2) At one point I tried running echo mem > /sys/power/state ; df. Afterthe resume, the IO read failed when trying to load df, and I got anerror message saying the kernel could not execute the binary file.Further attempts to run df failed also. Other IO at this point was fine.This leads me to think that when the IO failed, rather than inform thecalling code of the failure, for example, with an EIO status, the buffercache got filled with junk, and this should not happen. Either theoperation should succeed, and the correct data be returned, or it shouldfail, and the caller should be informed of the failure, and not givenincorrect data.When the first IO immediately following the suspend failed, I got thesemessages:

[   32.013538] ata1: command 0x35 timeout, stat 0x50 host_stat 0x1
[   32.045510] ata2: command 0x35 timeout, stat 0x50 host_stat 0x1

As long as no IO was immediately requested after the resume ( i.e. if Iecho mem > /sys/power/state on an otherwise idle system, rather thanusing suspend.sh ) then these errors did not happen, only the abnormalstatus messages did.

For reference, my system is configured as follows:

Motherboard: Asus K8V Deluxe
CPU: AMD Athlon 64 3200+
RAM: 1 GB of Corsair low latency pc3200 ddr sdram
Video: ATI Radeon 9800 Pro with a Samsung 930B 19 inch LCD display

Disks: 2 WD 36 gig SATA 10,000 rpm raptors in a raid0 configuration onthe via sata raid controller

Partitions:

/dev/mapper/via_hfciifae1: 40 gig winxp NTFS partition
/dev/mapper/via_hfciifae3: 10 gig experimental partition
/dev/mapper/via_hfciifae5: 50 meg ext2 /boot partition
/dev/mapper/via_hfciifae6: 1 gig swap partition
/dev/mapper/via_hfciifae7: 22 gig reiserfs root partition

If anyone has any suggestions of further tests I can perform to narrowdown the problem, or a better solution for it, you have my fullcooperation. If this fix seems acceptable, then I hope it can be mergedin the next kernel release.

PS> Please CC me on any replies, as I am not subscribed to this list




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: VIA SATA Raid needs a long time to recover from suspend
  - From: Andrew Morton <akpm@osdl.org>

Prev by Date: Re: [PATCH 12/18] shared mount handling: bind and rbind
Next by Date: Re: ADI Blackfin patch for kernel 2.6.14
Previous by thread: [RFC][PATCH 3/6] PCI PM: update pci_set_power_state()
Next by thread: Re: VIA SATA Raid needs a long time to recover from suspend
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]