Re: 2.6.15.1: persistent nasty hang in sync_page killing NFS (ne2k-pci / DP83815-related?), i686/PIII

On Mon, 2006-01-30 at 16:55 +0000, Nix wrote:
> On Sun, 29 Jan 2006, Trond Myklebust stipulated:
> > As a general rule of thumb: if tcpdump/ethereal can see the reply on the
> > client, then the engine socket should see it too. If tcpdump is indeed
> > seeing those replies, you should check the RPC code by
> > setting /proc/sys/sunrpc/rpc_debug to 1.
> 
> tcpdump is seeing them.

The complete packets, including all fragments?


> ... have a pile of messages in the midst of a locked-up transfer:
> 
> Jan 30 16:50:57 loki warning: kernel: -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- --exit--
> Jan 30 16:50:57 loki warning: kernel: 15046 0006 0021 -00011 c1a11600 100003 c801f000 00000000 xprt_resend c02c3798 c01c0e4d
> Jan 30 16:50:57 loki warning: kernel: 15047 0006 0021 000000 c1a11600 100003 c801f0b8 00000070 xprt_pending c02c3869 c01c0e4d
> Jan 30 16:50:57 loki warning: kernel: 15048 0006 0021 -00011 c1a11600 100003 c801f170 00000000 xprt_resend c02c3798 c01c0e4d
> Jan 30 16:50:57 loki warning: kernel: 15049 0006 0001 -00011 c1a11600 100003 c801f228 00000000 xprt_sending c02c3798 c01c0e4d
> Jan 30 16:50:57 loki warning: kernel: RPC: 15047 xprt_timer
> Jan 30 16:50:57 loki warning: kernel: RPC:      cong 256, cwnd was 256, now 256
> Jan 30 16:50:57 loki warning: kernel: RPC: 15048 xprt_cwnd_limited cong = 0 cwnd = 256
> Jan 30 16:50:57 loki warning: kernel: RPC: 15048 xprt_prepare_transmit
> Jan 30 16:50:57 loki warning: kernel: RPC: 15048 xprt_transmit(116)
> Jan 30 16:50:57 loki warning: kernel: RPC: 15048 xmit complete
> Jan 30 16:50:57 loki warning: kernel: RPC: 15047 xprt_prepare_transmit
> Jan 30 16:50:57 loki warning: kernel: RPC: 15047 xprt_cwnd_limited cong = 256 cwnd = 256
> Jan 30 16:50:57 loki warning: kernel: RPC: 15047 failed to lock transport c1a11800
> Jan 30 16:50:58 loki warning: kernel: RPC: 15048 xprt_timer
> Jan 30 16:50:58 loki warning: kernel: RPC:      cong 256, cwnd was 256, now 256
> Jan 30 16:50:58 loki warning: kernel: RPC: 15046 xprt_cwnd_limited cong = 0 cwnd = 256
> Jan 30 16:50:58 loki warning: kernel: RPC: 15046 xprt_prepare_transmit
> Jan 30 16:50:58 loki warning: kernel: RPC: 15046 xprt_transmit(116)
> Jan 30 16:50:58 loki warning: kernel: RPC: 15046 xmit complete
> Jan 30 16:50:58 loki warning: kernel: RPC: 15048 xprt_prepare_transmit
> Jan 30 16:50:58 loki warning: kernel: RPC: 15048 xprt_cwnd_limited cong = 256 cwnd = 256
> Jan 30 16:50:58 loki warning: kernel: RPC: 15048 failed to lock transport c1a11800
> Jan 30 16:50:59 loki warning: kernel: RPC: 15046 xprt_timer
> Jan 30 16:50:59 loki warning: kernel: RPC:      cong 256, cwnd was 256, now 256
> Jan 30 16:50:59 loki warning: kernel: RPC: 15047 xprt_cwnd_limited cong = 0 cwnd = 256
> Jan 30 16:50:59 loki warning: kernel: RPC: 15047 xprt_prepare_transmit
> Jan 30 16:50:59 loki warning: kernel: RPC: 15047 xprt_transmit(116)
> Jan 30 16:50:59 loki warning: kernel: RPC: 15047 xmit complete
> Jan 30 16:50:59 loki warning: kernel: RPC: 15046 xprt_prepare_transmit
> Jan 30 16:50:59 loki warning: kernel: RPC: 15046 xprt_cwnd_limited cong = 256 cwnd = 256
> Jan 30 16:50:59 loki warning: kernel: RPC: 15046 failed to lock transport c1a11800
> Jan 30 16:51:00 loki warning: kernel: RPC: 15047 xprt_timer
> [repeats indefinitely]
> Jan 30 16:51:38 loki warning: kernel: -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- --exit--
> Jan 30 16:51:38 loki warning: kernel: 15046 0006 0021 -00011 c1a11600 100003 c801f000 00000000 xprt_resend c02c3798 c01c0e4d
> Jan 30 16:51:38 loki warning: kernel: 15047 0006 0021 000000 c1a11600 100003 c801f0b8 00000140 xprt_pending c02c3869 c01c0e4d
> Jan 30 16:51:38 loki warning: kernel: 15048 0006 0021 -00011 c1a11600 100003 c801f170 00000000 xprt_resend c02c3798 c01c0e4d
> 
> The RPC messages are emitted at pretty much exactly the same frequency
> as the ACKs.
> 
> I *guess* that the `failed to lock transport' is the underlying error...
> time to add some debugging and find out what task is locking the
> transport. Back soon, must rebuild the kernel and reboot to clear this
> lock ;)

Nope. The congestion window is 1 request, and you do indeed appear to
have one request on the "pending" queue. The problem in the above trace
is a complete lack of data_ready messages, meaning that the socket is
never seeing any complete packets come in.

Cheers,
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: 2.6.15.1: UDP fragments >27208 bytes lost with ne2k-pci on DP83815 (was Re: persistent nasty hang in sync_page killing NFS (ne2k-pci / DP83815-related?), i686/PIII)
  - From: Nix <nix@esperi.org.uk>

References:
- 2.6.15.1: persistent nasty hang in sync_page killing NFS (ne2k-pci / DP83815-related?), i686/PIII
  - From: Nix <nix@esperi.org.uk>
- Re: 2.6.15.1: persistent nasty hang in sync_page killing NFS (ne2k-pci / DP83815-related?), i686/PIII
  - From: Trond Myklebust <trond.myklebust@fys.uio.no>
- Re: 2.6.15.1: persistent nasty hang in sync_page killing NFS (ne2k-pci / DP83815-related?), i686/PIII
  - From: Nix <nix@esperi.org.uk>
- Re: 2.6.15.1: persistent nasty hang in sync_page killing NFS (ne2k-pci / DP83815-related?), i686/PIII
  - From: Nix <nix@esperi.org.uk>
- Re: 2.6.15.1: persistent nasty hang in sync_page killing NFS (ne2k-pci / DP83815-related?), i686/PIII
  - From: Trond Myklebust <trond.myklebust@fys.uio.no>
- Re: 2.6.15.1: persistent nasty hang in sync_page killing NFS (ne2k-pci / DP83815-related?), i686/PIII
  - From: Nix <nix@esperi.org.uk>
- Re: 2.6.15.1: persistent nasty hang in sync_page killing NFS (ne2k-pci / DP83815-related?), i686/PIII
  - From: Trond Myklebust <trond.myklebust@fys.uio.no>
- Re: 2.6.15.1: persistent nasty hang in sync_page killing NFS (ne2k-pci / DP83815-related?), i686/PIII
  - From: Nix <nix@esperi.org.uk>

Prev by Date: Re: CD writing in future Linux try #2 [ was: Re: CD writing in future Linux (stirring up a hornets' nest) ]
Next by Date: Re: 2.6.15-rt16
Previous by thread: Re: 2.6.15.1: persistent nasty hang in sync_page killing NFS (ne2k-pci / DP83815-related?), i686/PIII
Next by thread: Re: 2.6.15.1: UDP fragments >27208 bytes lost with ne2k-pci on DP83815 (was Re: persistent nasty hang in sync_page killing NFS (ne2k-pci / DP83815-related?), i686/PIII)
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]