After upgrading from Fedora Core 6 (2.6.22) to Fedora Core 12 (2.6.32) I started seeing page allocation failures when running a UDP multicast packet writer under load. Under particularly high load, the machine sometimes crashes. This was fixed on one machine by adding memory (total memory increased from 1GB to 3GB). However, I notice some strange page alloc/free behavior in the sar logs. Here are sar log excerpts from a Xeon E5160 with 1GB of memory, running the packet writer on FC6 (2.6.22). Multicast traffic kicks in at 9:30: $ sar -s 09:00:00 -e 11:00:00 -r -f /var/log/sa/sa17 Linux 2.6.22.7-57.fc6 (ti112) 06/17/2010 09:00:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad 09:10:01 AM 37652 990364 96.34 11272 511860 1999292 636 0.03 28 09:20:01 AM 35724 992292 96.52 11320 513532 1999292 636 0.03 28 09:30:01 AM 39012 989004 96.21 8140 505332 1999292 636 0.03 28 09:40:02 AM 35916 992100 96.51 484 515620 1999292 636 0.03 0 09:50:03 AM 38324 989692 96.27 448 515348 1999292 636 0.03 0 10:00:03 AM 139940 888076 86.39 252 371424 1999292 636 0.03 0 10:10:04 AM 35720 992296 96.53 436 518692 1999292 636 0.03 0 10:20:02 AM 37140 990876 96.39 428 515760 1999292 636 0.03 0 10:30:01 AM 34360 993656 96.66 480 520180 1999292 636 0.03 0 10:40:02 AM 40108 987908 96.10 424 504508 1999292 636 0.03 0 10:50:02 AM 35800 992216 96.52 440 517664 1999292 636 0.03 0 Average: 46336 981680 95.49 3102 500902 1999292 636 0.03 8 $ sar -s 09:00:00 -e 11:00:00 -R -f /var/log/sa/sa17 Linux 2.6.22.7-57.fc6 (ti112) 06/17/2010 09:00:01 AM frmpg/s bufpg/s campg/s 09:10:01 AM 1.08 0.19 -0.77 09:20:01 AM -0.82 0.02 0.71 09:30:01 AM 1.40 -1.36 -3.50 09:40:02 AM -1.33 -3.28 4.40 09:50:03 AM 1.03 -0.02 -0.12 10:00:03 AM 43.55 -0.08 -61.68 10:10:04 AM -44.48 0.08 62.85 10:20:02 AM 0.61 -0.00 -1.26 10:30:01 AM -1.19 0.02 1.89 10:40:02 AM 2.45 -0.02 -6.69 10:50:02 AM -1.84 0.01 5.63 Average: 0.03 -0.40 0.16 $ sar -s 09:00:00 -e 11:00:00 -n DEV -f /var/log/sa/sa17 | egrep 'lan1|IFACE' 09:00:01 AM IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s 09:10:01 AM lan1 1145.19 0.00 489752.98 0.00 0.00 0.00 1145.19 09:20:01 AM lan1 1564.38 0.00 725456.33 0.00 0.00 0.00 1564.38 09:30:01 AM lan1 2104.73 0.00 910104.53 0.00 0.00 0.00 2104.73 09:40:02 AM lan1 12796.92 0.00 8852604.63 0.00 0.00 0.00 12796.92 09:50:03 AM lan1 14791.82 0.00 10975987.73 0.00 0.00 0.00 14791.82 10:00:03 AM lan1 13156.31 0.00 9424034.76 0.00 0.00 0.00 13156.31 10:10:04 AM lan1 17451.47 0.00 13812805.65 0.00 0.00 0.00 17451.47 10:20:02 AM lan1 15836.22 0.00 12108763.28 0.00 0.00 0.00 15836.22 10:30:01 AM lan1 13652.76 0.00 9891058.06 0.00 0.00 0.00 13652.76 10:40:02 AM lan1 14155.56 0.00 10490290.79 0.00 0.00 0.00 14155.56 10:50:02 AM lan1 14010.38 0.00 10386290.49 0.00 0.00 0.00 14010.38 Average: lan1 10962.41 0.00 8000729.59 0.00 0.00 0.00 10962.41 $ ethtool -i lan1 driver: e1000 version: 7.3.20-k2-NAPI firmware-version: 1.6-12 bus-info: 0000:04:00.1 Here's the logs for a Xeon E5150 with 1GB, running the same app on F12 (2.6.32). This host crashes occassionally under load since upgrading to F12 (it crashed around 10:00 AM today, as shown below). It also regularly generates page allocation failures. $ sar -s 09:00:00 -e 11:00:00 -r -f /var/log/sa/sa17 Linux 2.6.32.10-90.fc12.x86_64 (ti110) 06/17/2010 _x86_64_ (2 CPU) 09:00:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 09:10:01 AM 39300 983424 96.16 51824 406536 481980 15.45 09:20:02 AM 38612 984112 96.22 46464 413604 481980 15.45 09:30:01 AM 165364 857360 83.83 7220 286188 491696 15.76 09:40:02 AM 153480 869244 84.99 1896 347252 492528 15.79 09:50:03 AM 62312 960412 93.91 2120 442892 492224 15.78 10:00:02 AM 193040 829684 81.12 1780 270852 492228 15.78 Average: 108685 914039 89.37 18551 361221 488773 15.67 10:02:16 AM LINUX RESTART ... $ sar -s 09:00:00 -e 11:00:00 -R -f /var/log/sa/sa17 Linux 2.6.32.10-90.fc12.x86_64 (ti110) 06/17/2010 _x86_64_ (2 CPU) 09:00:01 AM frmpg/s bufpg/s campg/s 09:10:01 AM 1.72 -8.37 7.75 09:20:02 AM -0.29 -2.23 2.94 09:30:01 AM 52.77 -16.34 -53.04 09:40:02 AM -4.91 -2.20 25.24 09:50:03 AM -37.69 0.09 39.54 10:00:02 AM 54.19 -0.14 -71.32 Average: 10.92 -4.85 -8.09 10:02:16 AM LINUX RESTART ... $ sar -s 09:00:00 -e 11:00:00 -n DEV -f /var/log/sa/sa17 | egrep 'lan1|IFACE' 09:00:01 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s 09:10:01 AM lan1 797.12 0.00 319.19 0.00 0.00 0.00 802.30 09:20:02 AM lan1 1084.88 0.00 472.74 0.00 0.00 0.00 1085.10 09:30:01 AM lan1 1505.95 0.00 618.34 0.00 0.00 0.00 1498.23 09:40:02 AM lan1 9207.79 0.00 5931.84 0.00 0.00 0.00 9195.59 09:50:03 AM lan1 10506.49 0.00 7270.98 0.00 0.00 0.00 10502.26 10:00:02 AM lan1 9366.62 0.00 6258.04 0.00 0.00 0.00 9324.16 Average: lan1 5422.93 0.00 3486.58 0.00 0.00 0.00 5412.71 ... $ ethtool -i lan1 driver: e1000e version: 1.0.2-k2 firmware-version: 1.6-12 bus-info: 0000:04:00.1 Some observations: free memory (kbmemfree) changes drastically under F12 whereas it remains more or less constant around 40MB on FC6 (except for a spike at 10AM). Pages are freed from the cache (kbcached), possibly for socket buffers, and then allocated back to the cache, repeatedly. The page cache remains constant around 500MB on FC6 whereas the peak cache size remains below 500MB on F12. As mentioned above, after installing 2GB additional memory, I've avoided the allocation failures and crashing on a system running F12 on hardware identical to the second system above. However, I'm seeing similar kbmemfree/kbcached behavior, albeit with a larger page cache (~2GB). Also, average kbmemfree is higher (around 100MB). Lastly, on a Xeon E5530 (a newer processor than the two above) with 3GB memory, running the same workload on F12, I'm seeing memory behavior similar to the first system above: kbmemfree remains around 50MB and kbcached remains around 2GB. I've tried tuning the following sysctl parameters: - vm.dirty_background_ratio - vm.dirty_ratio - vm.vfs_cache_pressure - vm.swappiness - vm.min_free_kbytes and disabling /sys/block/*/queue/iosched/low_latency. The only settings that seem to have an effect (i.e. prevent the machine from crashing) are setting vm.swappiness=40 and vm.min_free_kbytes=49152, but the page allocation failures are still happening and packets are being dropped. I believe the packet loss is due to the NIC FIFO filling up (indicated by the rx_missed_errors ethtool counter) because socket buffers are unavailable to copy to (alloc_rx_buff_failed counter). Why does this workload require more memory under F12, but only on the older processor? Do the page allocation failures have something to do with the page cache reclaiming? Is there anything else I can tune to improve stability and reduce packet loss? Any insight is appreciated. - Kelvin -- users mailing list users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines