A lot of speculation follows, probably not very light reading and might be complete gooblegeep.
Tommy Reynolds wrote: | Try adding more swap space. Check the web for how to use an ordinary | file for this if you don't have any free disk space. Something like: |
Adding swap might help and I certainly hope it will. Under normal load the old swap, which was a half gigabyte in size, was in practice unused. Now the total amount of swap is four and half gigabytes which should be a lot more than is required.
Somehow this combination of events and programs caused a very rapid consumption of both cpu and memory which resulted state that was unrecoverable despite of OOM. As OOM killed http processes the load coming in from them should have vanished and the memory should have been freed. This did not happen and according to apache logs, if it was able to update it's logs, the external pressure had also vanished, that is, the spammer had stopped loading pages when they became unresponsive.
The httpd-process seems to peak it's usage of cpu and memory upon startup, so the OOM probably kept killing "same" innocent http process over and over again. Meanwhile nothing else got cpu but the http-process that spawned the new ones and the OOM that killed the httpds that were spawned.
Probably a better work around for the problem would be limiting resource usage of the apache user and the postgres user as Alexander Dalloz proposes. I'll probably try this if the increased swap does not help.
Thanks to both of you.
I could also work around this problem by implementing a script that monitors the resource usage of both postgres and apache users and shuts ~ the services down for a while when preset limit is exceeded or better yet, use nagios to do this.
What I am actually looking for is clues how to find out what causes the rapid consumption of the resources, where, by whom and how fast this actually happens. I'm looking for tools to do better post mortem diagnosis or tools that would gather me better information for post mortem diagnosis. The /var/log/messages with OOM lines did not help me a bit.
I have a hunch that either php, httpd or postgres has a bug in it that will cause it to consume everything that it can get when certain conditions are met. There are switches that can be turned in all of the three programs that might help but to identify which switches and in which program I need more info.
Regards,