Bash gurus,
I have a bash script that monitors a directory for files. Whenever it
finds files in this directory, it passes them to a support script for
processing. The support script moves the files to another directory
prior to processing them, and it is run in the background to prevent
blocking the main script. A simplified version of the main script loop
follows:
# Execute once every 10 seconds
while true;
do
# Fork a background script to process each file in the spool directory
for fname in `ls /spool/dir/*.ext 2> /dev/null`
do
bname=`basename $fname`
bg_script $bname &
done
sleep 10
done
This is pretty simple and it worked flawlessly for over a year on a dual
processor server running Fedora Core 3. However, after upgrading to an
8 core (2 CPUs x 4 cores) server running Fedora Core 6 the script hangs
a few times a week. This is a bad thing, so I have to keep a close eye
on the server until the bug is resolved.
The process tree of the script when it's hanging follows:
[root@server ~]# ps axjf
PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
1 3512 3510 2302 ? -1 S 0 0:59 /bin/bash
/usr/local/bin/script
3512 21432 3510 2302 ? -1 R 0 40:50 \_ /bin/bash
/usr/local/bin/script
Note that the parent process (PID 3512) is sleeping and has accumulated
relatively little CPU time since boot. The child process (PID 21432) is
running in a hard loop and top shows that it is consuming 100% of one of
the cores. It also never terminates, so it permanently blocks the
parent process. If the child process is killed, the execution of the
parent process restarts without any problems.
The interesting thing is that the script never calls itself. It only
calls the support script as a background job. I'm not an expert on the
inner workings of bash, but I believe that the child process is a
temporary artifact of the fork-exec call sequence used to run the
commands in the parent. It seems that a copy of the existing process is
created, but it is never overwritten with the child process.
I researched the logs and I'm fairly confident that the script is
hanging at the top of the for loop, presumably after exhausting the list
created by the "ls" command. There is nothing interesting about the
"ls" command itself, as there are usually less than 20 files in the
directory it's listing.
I'd appreciate any replies from anyone who has experienced this
problem. I have some ideas for working around it, but I'd like to
actually understand its cause and how to properly resolve it so that I
don't get stuck on something similar in the future.
Thank you,
Matthew Roth
InterMedia Marketing Solutions
Software Engineer and Systems Developer