Andrew Morton wrote:
This change will increase the minimum delay in both ata_wait_idle() and
ata_busy_wait() from 10 usec to 100 usec, which is not a good change.
It would be less damaging to increase the delay in ata_wait_idle() from
1000 to 100,000. A one second spin is a bit sad, but the hardware's bust,
so that's the least of the user's worries.
The best fix would be to identify the specific _callers_ of these functions
which need their delay increased. You can do that by adding:
if (max == 0)
dump_stack();
at the end of ata_busy_wait(). The resulting stack dumps will tell you
where the offending callsites are. With luck, it's just one. If we know
which one, that might point us at some chipset-level delay which we're
forgetting to do, or something like that.
Right, changing ata_wait_idle's maximum to 100,000 would be a better fix
as it would not effect the minimum wait time. Thanks for the
dump_stack() tip, I put that in and this is the call stack dumped for
the two timeouts following a resume ( though without the stack dump,
there were 6 timeouts reported, 3 on each channel ):
[ 2.650217] Call Trace:<ffffffff88048782>{:libata:ata_dev_select+102}
<ffffffff880488fa>{:libata:ata_qc_issue_prot+31}
[ 2.650244] <ffffffff8804ba48>{:libata:ata_scsi_translate+172}
<ffffffff880244bd>{:scsi_mod:scsi_done+0}
[ 2.650269] <ffffffff880244bd>{:scsi_mod:scsi_done+0}
<ffffffff8804bd06>{:libata:ata_scsi_queuecmd+219}
[ 2.650291] <ffffffff88024686>{:scsi_mod:scsi_dispatch_cmd+433}
[ 2.650305] <ffffffff8802978e>{:scsi_mod:scsi_request_fn+611}
<ffffffff80148ea3>{sync_page+0}
[ 2.650326] <ffffffff80224858>{generic_unplug_device+7}
<ffffffff80224851>{generic_unplug_device+0}
[ 2.650334] <ffffffff8807c118>{:dm_mod:dm_table_unplug_all+49}
<ffffffff8807a0f1>{:dm_mod:dm_unplug_all+29}
[ 2.650354] <ffffffff8807c118>{:dm_mod:dm_table_unplug_all+49}
<ffffffff8807a0f1>{:dm_mod:dm_unplug_all+29}
[ 2.650372] <ffffffff80164a08>{block_sync_page+58}
<ffffffff80148ed6>{sync_page+51}
[ 2.650381] <ffffffff8028eb11>{__wait_on_bit_lock+55}
<ffffffff801490fd>{__lock_page+94}
[ 2.650392] <ffffffff8013d37f>{wake_bit_function+0}
<ffffffff80149374>{find_get_page+12}
[ 2.650402] <ffffffff8014a4de>{do_generic_mapping_read+472}
<ffffffff8014947f>{file_read_actor+0}
[ 2.650410] <ffffffff8014af0a>{__generic_file_aio_read+333}
<ffffffff8014b0c0>{generic_file_read+187}
[ 2.650433] <ffffffff8013d274>{autoremove_wake_function+0}
<ffffffff80163898>{vfs_read+200}
[ 2.650453] <ffffffff8016ba1f>{kernel_read+65}
<ffffffff8016d4c7>{do_execve+291}
[ 2.650464] <ffffffff8010cc31>{sys_execve+51}
<ffffffff80139bd8>{worker_thread+0}
[ 2.650475] <ffffffff8010ef34>{execve+100}
<ffffffff80139bd8>{worker_thread+0}
[ 2.650482] <ffffffff801397bb>{__call_usermodehelper+0}
<ffffffff80139abc>{____call_usermodehelper+134}
[ 2.650496] <ffffffff8010eec6>{child_rip+8}
<ffffffff80139bd8>{worker_thread+0}
[ 2.650504] <ffffffff801397bb>{__call_usermodehelper+0}
<ffffffff80139a36>{____call_usermodehelper+0}
[ 2.650514] <ffffffff8010eebe>{child_rip+0}
[ 2.693624] ATA: abnormal status 0x80 on port 0xE007
[ 2.693626]
[ 2.693627] Call
Trace:<ffffffff880488fa>{:libata:ata_qc_issue_prot+31}
<ffffffff8804ba48>{:libata:ata_scsi_translate+172}
[ 2.693646] <ffffffff880244bd>{:scsi_mod:scsi_done+0}
<ffffffff880244bd>{:scsi_mod:scsi_done+0}
[ 2.693670] <ffffffff8804bd06>{:libata:ata_scsi_queuecmd+219}
<ffffffff88024686>{:scsi_mod:scsi_dispatch_cmd+433}
[ 2.693694] <ffffffff8802978e>{:scsi_mod:scsi_request_fn+611}
<ffffffff80148ea3>{sync_page+0}
[ 2.693712] <ffffffff80224858>{generic_unplug_device+7}
<ffffffff80224851>{generic_unplug_device+0}
[ 2.693720] <ffffffff8807c118>{:dm_mod:dm_table_unplug_all+49}
<ffffffff8807a0f1>{:dm_mod:dm_unplug_all+29}
[ 2.693738] <ffffffff8807c118>{:dm_mod:dm_table_unplug_all+49}
<ffffffff8807a0f1>{:dm_mod:dm_unplug_all+29}
[ 2.693756] <ffffffff80164a08>{block_sync_page+58}
<ffffffff80148ed6>{sync_page+51}
[ 2.693765] <ffffffff8028eb11>{__wait_on_bit_lock+55}
<ffffffff801490fd>{__lock_page+94}
[ 2.693775] <ffffffff8013d37f>{wake_bit_function+0}
<ffffffff80149374>{find_get_page+12}
[ 2.693785] <ffffffff8014a4de>{do_generic_mapping_read+472}
<ffffffff8014947f>{file_read_actor+0}
[ 2.693793] <ffffffff8014af0a>{__generic_file_aio_read+333}
<ffffffff8014b0c0>{generic_file_read+187}
[ 2.693816] <ffffffff8013d274>{autoremove_wake_function+0}
<ffffffff80163898>{vfs_read+200}
[ 2.693835] <ffffffff8016ba1f>{kernel_read+65}
<ffffffff8016d4c7>{do_execve+291}
[ 2.693846] <ffffffff8010cc31>{sys_execve+51}
<ffffffff80139bd8>{worker_thread+0}
[ 2.693857] <ffffffff8010ef34>{execve+100}
<ffffffff80139bd8>{worker_thread+0}
[ 2.693864] <ffffffff801397bb>{__call_usermodehelper+0}
<ffffffff80139abc>{____call_usermodehelper+134}
[ 2.693878] <ffffffff8010eec6>{child_rip+8}
<ffffffff80139bd8>{worker_thread+0}
[ 2.693885] <ffffffff801397bb>{__call_usermodehelper+0}
<ffffffff80139a36>{____call_usermodehelper+0}
[ 2.693896] <ffffffff8010eebe>{child_rip+0}
Maybe a better fix would be to insert a delay after the resume but
before the driver even tries to access the hardware. Maybe somewhere in
a method in the DSDT?
Or should have returned an IO error.
Yes, this might be a driver bug. Again, if we know precisely which
callsite is experiencing the timeout then we're better situated to fix it.
I made the timeouts happen again by returning the timeouts to the way
they were before, and issuing the command:
echo mem > /sys/power/state ; df
After the abnormal status messages, around 30 seconds went by, then this:
[ 14.596710] ata1: command 0x25 timeout, stat 0x50 host_stat 0x1
[ 14.596715]
[ 14.596716] Call
Trace:<ffffffff880498e2>{:libata:ata_eng_timeout+496}
<ffffffff88027690>{:scsi_mod:scsi_error_handler+0}
[ 14.596752] <ffffffff8804af99>{:libata:ata_scsi_error+21}
<ffffffff88027706>{:scsi_mod:scsi_error_handler+118}
[ 14.596775] <ffffffff88027690>{:scsi_mod:scsi_error_handler+0}
<ffffffff88027690>{:scsi_mod:scsi_error_handler+0}
[ 14.596803] <ffffffff8013d0d8>{kthread+191}
<ffffffff80139bd8>{worker_thread+0}
[ 14.596814] <ffffffff8010eec6>{child_rip+8}
<ffffffff80139bd8>{worker_thread+0}
[ 14.596822] <ffffffff8013cf04>{keventd_create_kthread+0}
<ffffffff8013d019>{kthread+0}
[ 14.596833] <ffffffff8010eebe>{child_rip+0}
[ 14.597403] ReiserFS: warning: is_tree_node: node level 2295 does not
match to the expected one 1
[ 14.597408] ReiserFS: dm-5: warning: vs-5150: search_by_key: invalid
format found in block 32828. Fsck?
[ 14.597414] ReiserFS: dm-5: warning: vs-13070:
reiserfs_read_locked_inode: i/o failure occurred trying to find stat
data of [2 1455 0x0 SD]
Bash reports command not found when trying to run df even now as I write
this.
I also noticed that during the 30 seconds or so that the system was
trying to load df before libata timed out the request, the cpu was
pegged at 100% utilization.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]