Re: VIA SATA Raid needs a long time to recover from suspend

Andrew Morton wrote:

This change will increase the minimum delay in both ata_wait_idle() and
ata_busy_wait() from 10 usec to 100 usec, which is not a good change.

It would be less damaging to increase the delay in ata_wait_idle() from
1000 to 100,000.  A one second spin is a bit sad, but the hardware's bust,
so that's the least of the user's worries.

The best fix would be to identify the specific _callers_ of these functions
which need their delay increased.   You can do that by adding:

	if (max == 0)
		dump_stack();

at the end of ata_busy_wait().  The resulting stack dumps will tell you
where the offending callsites are.  With luck, it's just one.  If we know
which one, that might point us at some chipset-level delay which we're
forgetting to do, or something like that.

Right, changing ata_wait_idle's maximum to 100,000 would be a better fixas it would not effect the minimum wait time. Thanks for thedump_stack() tip, I put that in and this is the call stack dumped forthe two timeouts following a resume ( though without the stack dump,there were 6 timeouts reported, 3 on each channel ):

[ 2.650217] Call Trace:<ffffffff88048782>{:libata:ata_dev_select+102}<ffffffff880488fa>{:libata:ata_qc_issue_prot+31}[ 2.650244] <ffffffff8804ba48>{:libata:ata_scsi_translate+172}<ffffffff880244bd>{:scsi_mod:scsi_done+0}[ 2.650269] <ffffffff880244bd>{:scsi_mod:scsi_done+0}<ffffffff8804bd06>{:libata:ata_scsi_queuecmd+219}

[    2.650291]        <ffffffff88024686>{:scsi_mod:scsi_dispatch_cmd+433}

[ 2.650305] <ffffffff8802978e>{:scsi_mod:scsi_request_fn+611}<ffffffff80148ea3>{sync_page+0}[ 2.650326] <ffffffff80224858>{generic_unplug_device+7}<ffffffff80224851>{generic_unplug_device+0}[ 2.650334] <ffffffff8807c118>{:dm_mod:dm_table_unplug_all+49}<ffffffff8807a0f1>{:dm_mod:dm_unplug_all+29}[ 2.650354] <ffffffff8807c118>{:dm_mod:dm_table_unplug_all+49}<ffffffff8807a0f1>{:dm_mod:dm_unplug_all+29}[ 2.650372] <ffffffff80164a08>{block_sync_page+58}<ffffffff80148ed6>{sync_page+51}[ 2.650381] <ffffffff8028eb11>{__wait_on_bit_lock+55}<ffffffff801490fd>{__lock_page+94}[ 2.650392] <ffffffff8013d37f>{wake_bit_function+0}<ffffffff80149374>{find_get_page+12}[ 2.650402] <ffffffff8014a4de>{do_generic_mapping_read+472}<ffffffff8014947f>{file_read_actor+0}[ 2.650410] <ffffffff8014af0a>{__generic_file_aio_read+333}<ffffffff8014b0c0>{generic_file_read+187}[ 2.650433] <ffffffff8013d274>{autoremove_wake_function+0}<ffffffff80163898>{vfs_read+200}[ 2.650453] <ffffffff8016ba1f>{kernel_read+65}<ffffffff8016d4c7>{do_execve+291}[ 2.650464] <ffffffff8010cc31>{sys_execve+51}<ffffffff80139bd8>{worker_thread+0}[ 2.650475] <ffffffff8010ef34>{execve+100}<ffffffff80139bd8>{worker_thread+0}[ 2.650482] <ffffffff801397bb>{__call_usermodehelper+0}<ffffffff80139abc>{____call_usermodehelper+134}[ 2.650496] <ffffffff8010eec6>{child_rip+8}<ffffffff80139bd8>{worker_thread+0}[ 2.650504] <ffffffff801397bb>{__call_usermodehelper+0}<ffffffff80139a36>{____call_usermodehelper+0}

[    2.650514]        <ffffffff8010eebe>{child_rip+0}
[    2.693624] ATA: abnormal status 0x80 on port 0xE007
[    2.693626]

[ 2.693627] CallTrace:<ffffffff880488fa>{:libata:ata_qc_issue_prot+31}<ffffffff8804ba48>{:libata:ata_scsi_translate+172}[ 2.693646] <ffffffff880244bd>{:scsi_mod:scsi_done+0}<ffffffff880244bd>{:scsi_mod:scsi_done+0}[ 2.693670] <ffffffff8804bd06>{:libata:ata_scsi_queuecmd+219}<ffffffff88024686>{:scsi_mod:scsi_dispatch_cmd+433}[ 2.693694] <ffffffff8802978e>{:scsi_mod:scsi_request_fn+611}<ffffffff80148ea3>{sync_page+0}[ 2.693712] <ffffffff80224858>{generic_unplug_device+7}<ffffffff80224851>{generic_unplug_device+0}[ 2.693720] <ffffffff8807c118>{:dm_mod:dm_table_unplug_all+49}<ffffffff8807a0f1>{:dm_mod:dm_unplug_all+29}[ 2.693738] <ffffffff8807c118>{:dm_mod:dm_table_unplug_all+49}<ffffffff8807a0f1>{:dm_mod:dm_unplug_all+29}[ 2.693756] <ffffffff80164a08>{block_sync_page+58}<ffffffff80148ed6>{sync_page+51}[ 2.693765] <ffffffff8028eb11>{__wait_on_bit_lock+55}<ffffffff801490fd>{__lock_page+94}[ 2.693775] <ffffffff8013d37f>{wake_bit_function+0}<ffffffff80149374>{find_get_page+12}[ 2.693785] <ffffffff8014a4de>{do_generic_mapping_read+472}<ffffffff8014947f>{file_read_actor+0}[ 2.693793] <ffffffff8014af0a>{__generic_file_aio_read+333}<ffffffff8014b0c0>{generic_file_read+187}[ 2.693816] <ffffffff8013d274>{autoremove_wake_function+0}<ffffffff80163898>{vfs_read+200}[ 2.693835] <ffffffff8016ba1f>{kernel_read+65}<ffffffff8016d4c7>{do_execve+291}[ 2.693846] <ffffffff8010cc31>{sys_execve+51}<ffffffff80139bd8>{worker_thread+0}[ 2.693857] <ffffffff8010ef34>{execve+100}<ffffffff80139bd8>{worker_thread+0}[ 2.693864] <ffffffff801397bb>{__call_usermodehelper+0}<ffffffff80139abc>{____call_usermodehelper+134}[ 2.693878] <ffffffff8010eec6>{child_rip+8}<ffffffff80139bd8>{worker_thread+0}[ 2.693885] <ffffffff801397bb>{__call_usermodehelper+0}<ffffffff80139a36>{____call_usermodehelper+0}

[    2.693896]        <ffffffff8010eebe>{child_rip+0}

Maybe a better fix would be to insert a delay after the resume butbefore the driver even tries to access the hardware. Maybe somewhere ina method in the DSDT?

Or should have returned an IO error.

Yes, this might be a driver bug.  Again, if we know precisely which
callsite is experiencing the timeout then we're better situated to fix it.

I made the timeouts happen again by returning the timeouts to the waythey were before, and issuing the command:

echo mem > /sys/power/state ; df

After the abnormal status messages, around 30 seconds went by, then this:

[   14.596710] ata1: command 0x25 timeout, stat 0x50 host_stat 0x1
[   14.596715]

[ 14.596716] CallTrace:<ffffffff880498e2>{:libata:ata_eng_timeout+496}<ffffffff88027690>{:scsi_mod:scsi_error_handler+0}[ 14.596752] <ffffffff8804af99>{:libata:ata_scsi_error+21}<ffffffff88027706>{:scsi_mod:scsi_error_handler+118}[ 14.596775] <ffffffff88027690>{:scsi_mod:scsi_error_handler+0}<ffffffff88027690>{:scsi_mod:scsi_error_handler+0}[ 14.596803] <ffffffff8013d0d8>{kthread+191}<ffffffff80139bd8>{worker_thread+0}[ 14.596814] <ffffffff8010eec6>{child_rip+8}<ffffffff80139bd8>{worker_thread+0}[ 14.596822] <ffffffff8013cf04>{keventd_create_kthread+0}<ffffffff8013d019>{kthread+0}

[   14.596833]        <ffffffff8010eebe>{child_rip+0}

[ 14.597403] ReiserFS: warning: is_tree_node: node level 2295 does notmatch to the expected one 1[ 14.597408] ReiserFS: dm-5: warning: vs-5150: search_by_key: invalidformat found in block 32828. Fsck?[ 14.597414] ReiserFS: dm-5: warning: vs-13070:reiserfs_read_locked_inode: i/o failure occurred trying to find statdata of [2 1455 0x0 SD]

Bash reports command not found when trying to run df even now as I writethis.I also noticed that during the 30 seconds or so that the system wastrying to load df before libata timed out the request, the cpu waspegged at 100% utilization.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: VIA SATA Raid needs a long time to recover from suspend
  - From: Mark Lord <lkml@rtr.ca>

References:
- VIA SATA Raid needs a long time to recover from suspend
  - From: Phillip Susi <psusi@cfl.rr.com>
- Re: VIA SATA Raid needs a long time to recover from suspend
  - From: Andrew Morton <akpm@osdl.org>

Prev by Date: Re: 2.6.15-rc1-git4 build failure on ppc64
Next by Date: Problems with fcc_enet driver for MPC8248 (2.6.12)
Previous by thread: Re: VIA SATA Raid needs a long time to recover from suspend
Next by thread: Re: VIA SATA Raid needs a long time to recover from suspend
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]