Kernel 2.6.17-rc5 under amd64 may be hanging I/O, [Was] Re: [Q] What would cause fsck running on a drbd device to just stop?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



It appears that kernel 2.6.17-rc5 under amd64 may be hanging I/O
processes spontaneously at random.

Our setup here uses drbd (ver. 0.7.19), which is a network RAID
kernel module, and initially I was fscking (ext3) filesystems, which
are on drbd devices, and the fsck just stopped spontaneously on the
two of them.

Today, I tried copying a directory on the command line with cp -a on
this computer (on a drbd-managed device) and then in mid-copy I tried
to abort the process with control-C. It did not abort. I tried
killing it with kill and then with kill -9. It turns out that the
process had died, but is still left in ps.

Anyway, I tried bringing down the peer and bringing it back up and it stalled.

I also had an emerge sync (i.e., the Gentoo update mechanism) going
and it too got stuck, but it doesn't, or at least shouldn't affect
drbd disks, implying this is not a drbd bug, but a kernel bug.

* what are the numbers in /proc/drbd

This is how it appears long after the copy hang and I stopped and
just restarted drbd on the peer. The output from then hanging
computer:


version: 0.7.19 (api:78/proto:74)
SVN Revision: 2212 build by root@thewarehouse1, 2006-05-28 17:54:59
  0: cs:SyncSource st:Primary/Secondary ld:Consistent
     ns:844864 nr:0 dw:325684 dr:940453 al:200 bm:123 lo:0 pe:0 ua:13 ap:0
         [===================>] sync'ed:100.0% (52/52)K
         stalled
  1: cs:SyncSource st:Primary/Secondary ld:Consistent
ns:41686716 nr:0 dw:41166664 dr:57457996 al:76392 bm:94 lo:0 pe:0 ua:0 ap:0
         [===================>] sync'ed:100.0% (120/120)K
         stalled
  2: cs:SyncSource st:Primary/Secondary ld:Consistent
ns:45970636 nr:0 dw:45451468 dr:27302312 al:45571 bm:63 lo:0 pe:0 ua:0 ap:0
         [===================>] sync'ed:100.0% (1020/1020)K
         stalled
  3: cs:PausedSyncS st:Primary/Secondary ld:Consistent
     ns:284957716 nr:0 dw:284450048 dr:251121904 al:266659 bm:759 lo:0
pe:0 ua:0 ap:0
  4: cs:WFReportParams st:Primary/Unknown ld:Consistent
     ns:92534448 nr:0 dw:133197016 dr:64860944 al:157353 bm:21176
lo:19 pe:0 ua:0 ap:0
  5: cs:WFReportParams st:Primary/Unknown ld:Consistent
     ns:64605420 nr:0 dw:64085232 dr:54424576 al:132822 bm:121 lo:6
pe:0 ua:0 ap:0
  6: cs:Connected st:Primary/Secondary ld:Consistent
     ns:88661276 nr:0 dw:88141084 dr:43618112 al:123587 bm:122 lo:0
pe:0 ua:0 ap:0
  7: cs:PausedSyncS st:Primary/Secondary ld:Consistent
     ns:79213408 nr:0 dw:78693368 dr:42392908 al:184445 bm:128 lo:0
pe:0 ua:0 ap:0
  8: cs:WFReportParams st:Primary/Unknown ld:Consistent
     ns:114288948 nr:0 dw:114035536 dr:113012216 al:80028 bm:2102 lo:2
pe:0 ua:0 ap:0

* get the wchan of the stuck processes
   via top, or ps -eo pid,wchan:40,comm, or cat /proc/<pid>/wchan
     (not only the fsck, but everything that may be
     related, so fs-related kernel threads, vm-related
     kernel threads, drbd-related kernel-threads).


  PID WCHAN                                    COMMAND         STAT
     1 select                                   init            Ss
     2 migration_thread                         migration/0     S
     3 ksoftirqd                                ksoftirqd/0     SN
     4 watchdog                                 watchdog/0      S
     5 migration_thread                         migration/1     S
     6 ksoftirqd                                ksoftirqd/1     SN
     7 watchdog                                 watchdog/1      S
     8 worker_thread                            events/0        S<
     9 worker_thread                            events/1        S<
    10 worker_thread                            khelper         S<
    11 worker_thread                            kthread         S<
    16 worker_thread                            kblockd/0       S<
    17 worker_thread                            kblockd/1       S<
    18 serio_thread                             kseriod         S<
   108 drbd_al_begin_io                         kswapd0         D
   109 drbd_al_begin_io                         kswapd1         D
   110 worker_thread                            aio/0           S<
   111 worker_thread                            aio/1           S<
   746 scsi_error_handler                       scsi_eh_0       S<
   750 worker_thread                            ata/0           S<
   751 worker_thread                            ata/1           S<
   774 worker_thread                            kpsmoused       S<
  1380 worker_thread                            kmirrord        S<
  1918 scsi_error_handler                       scsi_eh_1       S<
  2136 kjournald                                kjournald       S<
  2801 select                                   udevd           S<s
  3842 kjournald                                kjournald       S<
  3844 kjournald                                kjournald       S<
  5638 sys_poll                                 metalog         Ss
  5639 syslog                                   metalog         S
  6597 select                                   sshd            Ss
  6767 sk_wait_data                             drbd0_receiver  S
  6769 sk_wait_data                             drbd1_receiver  S
  6771 sk_wait_data                             drbd7_receiver  S
  6773 sk_wait_data                             drbd2_receiver  S
  6775 sk_wait_data                             drbd3_receiver  S
  6777 drbd_determin_dev_size                   drbd8_receiver  D
  6779 drbd_determin_dev_size                   drbd4_receiver  D
  6781 drbd_determin_dev_size                   drbd5_receiver  D
  6783 sk_wait_data                             drbd6_receiver  S
  7013 sys_poll                                 heartbeat       SLs
  7035 pipe_wait                                heartbeat       SL
  7036 sys_poll                                 heartbeat       SL
  7037 skb_recv_datagram                        heartbeat       SL
  7184 wait                                     mon             S
  7327 skb_recv_datagram                        portsentry      Ss
  7329 skb_recv_datagram                        portsentry      Ss
  7444 select                                   master          Ss
  7459 select                                   qmgr            S
  7589 hrtimer_nanosleep                        cron            Ss
  7717 read_chan                                agetty          Ss+
  7718 read_chan                                agetty          Ss+
  7719 read_chan                                agetty          Ss+
  7720 read_chan                                agetty          Ss+
  7721 read_chan                                agetty          Ss+
  7722 read_chan                                agetty          Ss+
  8923 kjournald                                kjournald       S<
  8971 kjournald                                kjournald       S<
  9019 kjournald                                kjournald       S<
  9067 kjournald                                kjournald       S<
  9115 drbd_al_begin_io                         kjournald       D<
  9163 kjournald                                kjournald       S<
  9211 kjournald                                kjournald       S<
  9259 kjournald                                kjournald       S<
  9307 drbd_al_begin_io                         kjournald       D<
  9438 select                                   atalkd          S
  9544 select                                   smbd            Ss
  9548 pause                                    smbd            S
  9554 select                                   nmbd            Ss
  9688 select                                   mysqld          Ssl
  9901 select                                   cnid_metad      S
  9910 select                                   afpd            S
11332 compat_sys_nanosleep                     retroclient     Ss
11333 sys_poll                                 retroclient     S
11335 compat_sys_nanosleep                     retroclient     S
11336 inet_csk_accept                          retroclient     S
11337 skb_recv_datagram                        retroclient     S
11344 inet_csk_accept                          retroclient     S
11345 skb_recv_datagram                        retroclient     S
11352 inet_csk_accept                          retroclient     S
11353 skb_recv_datagram                        retroclient     S
11360 inet_csk_accept                          retroclient     S
11365 compat_sys_nanosleep                     retroclient     S
11908 select                                   cnid_dbd        S
  8270 select                                   smbd            S
10289 drbd_al_begin_io                         smbd            D
31891 select                                   smbd            S
29010 drbd_al_begin_io                         smbd            D
24530 select                                   smbd            S
32250 pdflush                                  pdflush         S
19751 drbd_al_begin_io                         pdflush         D
  3856 sk_wait_data                             afpd            S
  8543 select                                   smbd            S
11364 sk_wait_data                             afpd            S
27917 sk_wait_data                             afpd            S
27918 select                                   cnid_dbd        S
22538 select                                   smbd            S
12830 select                                   smbd            S
17962 select                                   smbd            S
23331 select                                   smbd            S
28361 select                                   smbd            S
28411 wait                                     bash            S
  3404 select                                   smbd            S
13660 select                                   smbd            S
23339 sk_wait_data                             afpd            S
27168 kjournald                                kjournald       S<
25449 select                                   smbd            S
28328 select                                   smbd            S
32490 select                                   apache2         Ss
24761 skb_recv_datagram                        apache2         S
24764 inet_csk_accept                          apache2         S
24765 inet_csk_accept                          apache2         S
24766 inet_csk_accept                          apache2         S
24767 inet_csk_accept                          apache2         S
24768 inet_csk_accept                          apache2         S
  2091 select                                   smbd            S
10197 select                                   smbd            S
10832 select                                   smbd            S
15513 select                                   smbd            S
19215 select                                   smbd            S
19585 select                                   smbd            S
19761 drbd_al_begin_io                         smbd            D
21619 inet_csk_accept                          apache2         S
21620 inet_csk_accept                          apache2         S
30680 select                                   smbd            S
31724 select                                   smbd            S
32105 sk_wait_data                             afpd            S
  8343 select                                   smbd            S
17100 sk_wait_data                             afpd            S
14374 compat_sys_nanosleep                     retroclient     S
14376 get_write_access                         retropds.23     D
14397 pipe_wait                                retroclient     S
27942 pipe_wait                                cron            S
27943 wait                                     bash            Ss
27944 wait                                     run-crons       S
27973 wait                                     hotbackup_daily S
27976 wait                                     hotbackup_daily S
28053 pipe_wait                                sendmail        S
28056 unix_stream_recvmsg                      postdrop        S
28058 sync_buffer                              mv              D
28509 get_write_access                         find            D
28512 pipe_wait                                sort            S
28514 pipe_wait                                sed             S
12803 wait                                     slocate         S
12804 vfs_readdir                              updatedb        DN
28845 drbd_al_begin_io                         smbd            D
  2145 drbd_al_begin_io                         smbd            D
  2885 select                                   smbd            S
  3571 select                                   smbd            S
  5313 select                                   smbd            S
  5417 drbd_al_begin_io                         smbd            D
  5444 drbd_al_begin_io                         mon             D
  5468 drbd_al_begin_io                         smbd            D
  6218 wait                                     bash            S
  6237 drbd_al_begin_io                         cp              D
  6332 wait                                     bash            S
  6340 vfs_unlink                               rm              D
  6390 select                                   smbd            S
  6499 vfs_readdir                              ls              D
  6595 sk_wait_data                             afpd            S
  6947 log_wait_commit                          cnid_dbd        D
  7084 sk_wait_data                             afpd            S
  7136 drbd_al_begin_io                         smbd            D
  7149 select                                   smbd            S
  7213 drbd_al_begin_io                         afpd            D
  7754 drbd_al_begin_io                         smbd            D
  7757 drbd_al_begin_io                         afpd            D
  7758 drbd_al_begin_io                         smbd            D
  7764 select                                   smbd            S
  7771 pipe_wait                                cron            S
  7772 wait                                     bash            Ss
  7773 drbd_al_begin_io                         run-crons       D
  7778 drbd_al_begin_io                         afpd            D
  8002 select                                   smbd            S
  8088 select                                   smbd            S
  8220 drbd_al_begin_io                         afpd            D
  8222 log_wait_for_space                       cnid_dbd        D
  8482 drbd_al_begin_io                         emerge          D
  9451 select                                   smbd            S
  9575 drbd_al_begin_io                         afpd            D
  9613 wait                                     mail.alert      S
  9614 unix_stream_recvmsg                      sendmail        S
  9616 drbd_al_begin_io                         postdrop        D
  9633 sk_wait_data                             afpd            S
  9697 select                                   pickup          S
  9918 drbd_al_begin_io                         emerge          D
  9924 drbd_al_begin_io                         sshd            D
  9936 unix_stream_recvmsg                      sshd            Ss
  9941 select                                   sshd            S
  9942 wait                                     bash            Ss
  9951 unix_stream_recvmsg                      sshd            Ss
  9956 select                                   sshd            S
  9957 wait                                     bash            Ss
  9962 read_chan                                bash            S+
10126 drbd_thread_setup                        drbd6_worker    S
10136 select                                   smbd            S
10142 sk_wait_data                             drbd6_asender   S
10143 sk_wait_data                             drbd0_asender   S
10144 sk_wait_data                             drbd2_asender   S
10145 sk_wait_data                             drbd1_asender   S
10146 sk_wait_data                             drbd3_asender   S
10147 sk_wait_data                             drbd7_asender   S
10163 -                                        ps              R+

* get a backtrace of the stuck processess
   e.g. echo t > /proc/sysrq-trigger
   (again, everything that may be related)
   note that these are not necessarily reliable,
  but may be helpful to understand whats going on.


The above and remaining portions of this post are available entirely online at http://www.kennedy.aecom.yu.edu/misc/kernel_2.6.17-rc5_drbd_hanging_io.html
--

Maurice Volaski, [email protected]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux