Thank you for these thoughts, Willy. Because the affected machine is a
server in another state I first tried going ahead and upgrading to 2.6.17.3.
The machine has now been up for 91 days and counting; in the year since
the mentioned change that brought on the segfaults I do not believe it
had ever ran that long. So it seems 2.4.32 to 2.6.17.3 either fixed
or happened to sidestep the issue. If I had another machine with the
problem I would be happy to investigate further, but, unfortunately, this
machine needs to stay up.
I just wanted to let possible others with similar problems know and
thank you and Grant Coady for your feedback!
On Sun, Sep 17, 2006 at 07:50:32AM +0200, Willy Tarreau wrote:
> On Sat, Sep 16, 2006 at 04:24:02PM -0700, Chris Frost wrote:
> > [1.] One line summary of the problem:
> > 2.4.32 proc_pid_stat() repeatedly segfaults.
> >
> > [2.] Full description of the problem/report:
> > 2.4.32 kernel, after being up for a few days to a few weeks, repeatedly
> > segfaults in proc_pid_stat(), triggered by w, ps, and other programs.
> >
> > [3.] Keywords (i.e., modules, networking, kernel):
> > kernel proc_pid_stat()
> >
> > [4.] Kernel version (from /proc/version):
> > $ cat /proc/version
> > Linux version 2.4.32 (root@tiger) (gcc version 3.3.5 (Debian 1:3.3.5-13)) #1 Wed Dec 21 10:57:37 CST 2005
> >
> > [5.] Output of Oops.. message (if applicable) with symbolic information
> > resolved (see Documentation/oops-tracing.txt)
> > See attached oops.w.log (triggered by w) and oops.ps.log (triggered by ps).
> >
> > [6.] A small shell script or example program which triggers the
> > problem (if possible)
> > Once it starts oopsing, "w" and "ps aux" trigger an oops every time.
> >
> > [7.] Environment
> > i386 Debian sarge on a network server in a closet.
> >
<snip hardware information from original email>
> >
> > [7.7.] Other information that might be relevant to the problem
> > (please look in /proc and include all information that you
> > think to be relevant):
> > ls -lR /proc does not trigger an oops.
> >
> > [X.] Other notes, patches, fixes, workarounds:
> > I am not familiar with this aspect of linux, but 2.4.33.3's proc_pid_stat()
> > appears to be identical. It also appears there have been several bug fixes
> > in 2.6 (including race condition corrections); perhaps there are issues
> > fixed in 2.6 but whose patches not been backported to 2.4?
>
> The code in 2.6 is quite different. Your problem here does not seem related
> to locking, because you can repeat it at will. I rather think that one of
> your tasks is going ill. I looked at the oops and compared with the code.
> In your case, the task->sig pointer equals 0x170, which is clearly wrong.
> I suspect it's a kernel thread which goes mad, because user tasks should
> not be able to write anything there.
>
> When this problem happens, could you try to identify the wrong task ?
> Basically, this should help :
>
> # cd /proc
> # for i in [0-9]*; do
> echo "Trying pid $i..."
> if ! cat $i/stat > /dev/null; then
> echo "!!! BAD PID : $i !!!"
> fi
> done
>
> If it is a kernel thread (low pid), you will not be able to find
> which one until you reboot, because the only other entry which
> might return its name is "status" which also uses collect_sigign_sigcatch().
> Otherwise, it is easy to find the full command line of the process :
>
> # echo $(tr '\000' ' ' < /proc/$BADPID/cmdline)
>
> > The machine in question become unstable, about a year ago, when an additional
> > harddrive and ram were added and the kernel upgraded from 2.2.19 (with ext3)
> > to 2.4.31. There, there may certainly be a hardware issue playing a role
> > here. However, since the problem is completely reliable once it starts within
> > a given boot and occurs, given time, across reboots, it seems likely
> > that a software bug may be involved.
>
> To be honnest, I'm skeptical. This is the first report of such an easily
> reproductible problem. Since you added RAM in your system, I would strongly
> suggest passing memtest on it during a full night. Random bit flips might
> turn a null into non-null, causing some unexpected code paths to be taken.
--
Chris Frost | <http://www.frostnet.net/chris/>
-------------+----------------------------------
Public PGP Key:
Email [email protected] with the subject "retrieve pgp key"
or visit <http://www.frostnet.net/chris/about/pgp_key.phtml>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]