Re: PROBLEM: 2.4 oops: proc_pid_stat()

Thank you for these thoughts, Willy. Because the affected machine is a
server in another state I first tried going ahead and upgrading to 2.6.17.3.
The machine has now been up for 91 days and counting; in the year since
the mentioned change that brought on the segfaults I do not believe it
had ever ran that long. So it seems 2.4.32 to 2.6.17.3 either fixed
or happened to sidestep the issue. If I had another machine with the
problem I would be happy to investigate further, but, unfortunately, this
machine needs to stay up.

I just wanted to let possible others with similar problems know and
thank you and Grant Coady for your feedback!


On Sun, Sep 17, 2006 at 07:50:32AM +0200, Willy Tarreau wrote:
> On Sat, Sep 16, 2006 at 04:24:02PM -0700, Chris Frost wrote:
> > [1.] One line summary of the problem:
> > 2.4.32 proc_pid_stat() repeatedly segfaults.
> > 
> > [2.] Full description of the problem/report:
> > 2.4.32 kernel, after being up for a few days to a few weeks, repeatedly
> > segfaults in proc_pid_stat(), triggered by w, ps, and other programs.
> > 
> > [3.] Keywords (i.e., modules, networking, kernel):
> > kernel proc_pid_stat()
> > 
> > [4.] Kernel version (from /proc/version):
> > $ cat /proc/version
> > Linux version 2.4.32 (root@tiger) (gcc version 3.3.5 (Debian 1:3.3.5-13)) #1 Wed Dec 21 10:57:37 CST 2005
> > 
> > [5.] Output of Oops.. message (if applicable) with symbolic information 
> >      resolved (see Documentation/oops-tracing.txt)
> > See attached oops.w.log (triggered by w) and oops.ps.log (triggered by ps).
> > 
> > [6.] A small shell script or example program which triggers the
> >      problem (if possible)
> > Once it starts oopsing, "w" and "ps aux" trigger an oops every time.
> > 
> > [7.] Environment
> > i386 Debian sarge on a network server in a closet.
> > 

<snip hardware information from original email>

> > 
> > [7.7.] Other information that might be relevant to the problem
> >        (please look in /proc and include all information that you
> >        think to be relevant):
> > ls -lR /proc does not trigger an oops.
> > 
> > [X.] Other notes, patches, fixes, workarounds:
> > I am not familiar with this aspect of linux, but 2.4.33.3's proc_pid_stat()
> > appears to be identical. It also appears there have been several bug fixes
> > in 2.6 (including race condition corrections); perhaps there are issues
> > fixed in 2.6 but whose patches not been backported to 2.4?
> 
> The code in 2.6 is quite different. Your problem here does not seem related
> to locking, because you can repeat it at will. I rather think that one of
> your tasks is going ill. I looked at the oops and compared with the code.
> In your case, the task->sig pointer equals 0x170, which is clearly wrong.
> I suspect it's a kernel thread which goes mad, because user tasks should
> not be able to write anything there.
> 
> When this problem happens, could you try to identify the wrong task ?
> Basically, this should help :
> 
> # cd /proc
> # for i in [0-9]*; do
>     echo "Trying pid $i..."
>     if ! cat $i/stat > /dev/null; then
>       echo "!!! BAD PID : $i !!!"
>     fi
>  done
> 
> If it is a kernel thread (low pid), you will not be able to find
> which one until you reboot, because the only other entry which
> might return its name is "status" which also uses collect_sigign_sigcatch().
> Otherwise, it is easy to find the full command line of the process :
> 
> # echo $(tr '\000' ' ' < /proc/$BADPID/cmdline)
> 
> > The machine in question become unstable, about a year ago, when an additional
> > harddrive and ram were added and the kernel upgraded from 2.2.19 (with ext3)
> > to 2.4.31. There, there may certainly be a hardware issue playing a role
> > here. However, since the problem is completely reliable once it starts within
> > a given boot and occurs, given time, across reboots, it seems likely
> > that a software bug may be involved.
> 
> To be honnest, I'm skeptical. This is the first report of such an easily
> reproductible problem. Since you added RAM in your system, I would strongly
> suggest passing memtest on it during a full night. Random bit flips might
> turn a null into non-null, causing some unexpected code paths to be taken.

-- 
Chris Frost  |  <http://www.frostnet.net/chris/>
-------------+----------------------------------
Public PGP Key:
   Email [email protected] with the subject "retrieve pgp key"
   or visit <http://www.frostnet.net/chris/about/pgp_key.phtml>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Prev by Date: Re: 2.6.19 file content corruption on ext3
Next by Date: Re: xfslogd-spinlock bug?
Previous by thread: [PATCH, RFC] reimplement flush_workqueue()
Next by thread: drivers/media/video/usbvision/usbvision-video.c: negative array
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]