We have a threading library which has been in production for six years and currently functions on Solaris 2.6-2.9 Sparc, Solaris 2.7-2.10 x86, HP-UX 11.00, Tru64 5.1(a,b), AIX 4.3.x and AIX 5.x.
The library starts up within the current process 5-8 threads, the operation runs to completion (with or without error), the threads complete or are canceled and then complete depending on what happened during processing.
At some latter time this repeated N times without the main process exiting. The threads are NOT detached.
The problem occurs on Fedora Core 3 if thread has exited exited and pthread_cancel is called with a thread id of a thread which has completed.
If thread has exited and we call pthread_cancel with that thread id on Fedora Core 3
( version info
getconf GNU_LIBPTHREAD_VERSION
NPTL 2.3.4
>uname -a
Linux irl-73-26 2.6.10-1.770_FC3 #1 Thu Feb 24 14:00:06 EST 2005 i686 i686 i386 GNU/Linux
)
the application segfaults. Is this the expected behavior?
I am also getting a segfault when pthread_cond_timedwait is called, I still determining the
exact state when the segfault occurred. The back trace shows
#0 0x005c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x00839dbc in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
The directory listing shows:
ls -l /lib/tls/
total 1936
drwxr-xr-x 2 root root 4096 Mar 23 04:03 i486
drwxr-xr-x 2 root root 4096 Mar 23 04:03 i586
drwxr-xr-x 2 root root 4096 Mar 23 04:03 i686
-rwxr-xr-x 1 root root 1524828 Dec 21 02:04 libc-2.3.4.so
lrwxrwxrwx 1 root root 13 Mar 22 18:42 libc.so.6 -> libc-2.3.4.so
-rwxr-xr-x 1 root root 215272 Dec 21 02:04 libm-2.3.4.so
lrwxrwxrwx 1 root root 13 Mar 22 18:42 libm.so.6 -> libm-2.3.4.so
-rwxr-xr-x 1 root root 108560 Dec 21 02:04 libpthread-2.3.4.so
lrwxrwxrwx 1 root root 19 Mar 22 18:42 libpthread.so.0 -> libpthread-2.3.4.so
-rwxr-xr-x 1 root root 50984 Dec 21 02:04 librt-2.3.4.so
lrwxrwxrwx 1 root root 14 Mar 22 18:42 librt.so.1 -> librt-2.3.4.so
-rwxr-xr-x 1 root root 32308 Dec 21 02:04 libthread_db-1.0.so
lrwxrwxrwx 1 root root 19 Mar 22 18:42 libthread_db.so.1 -> libthread_db-1.0.so
Is this what NPTL on Fedora Core 3 does TODAY? or is there a problem in the sequence of releasing mutex's or condition variables that would cause this behavior in our code on Fedora Core 3.
We maintain internal thread exit status so I can skip cancelling the threads which have succesfully exited. We normally just cancel everything we started just
as a big hammer to make sure every thread shuts down and exits. We can make the abort function a bit smarter since it has access to our internal thread status if need be.
On the OS's I mentioned above 0 is returned on success, on failure:
On HP-UX 11.00 pthread_cancel returns the value ERSCH, errno is NOT set.
On Solaris SPARC and x86 same as HP-UX 11.00
AIX same as HP-UX an Solaris.
On Tru64 pthread_cancel returns EINVAL or ESRCH, errno is not set.
Eric Bruno.