Erratic memory allocation, page allocation failures, and crashing on 2.6.32 with e1000e

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



After upgrading from Fedora Core 6 (2.6.22) to Fedora Core 12 (2.6.32) I
started seeing page allocation failures when running a UDP multicast packet
writer under load. Under particularly high load, the machine sometimes crashes.
This was fixed on one machine by adding memory (total memory increased from 1GB
to 3GB). However, I notice some strange page alloc/free behavior in the sar
logs.

Here are sar log excerpts from a Xeon E5160 with 1GB of memory, running the
packet writer on FC6 (2.6.22). Multicast traffic kicks in at 9:30:

$ sar -s 09:00:00 -e 11:00:00 -r -f /var/log/sa/sa17
Linux 2.6.22.7-57.fc6 (ti112)   06/17/2010

09:00:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached kbswpfree kbswpused  %swpused  kbswpcad
09:10:01 AM     37652    990364     96.34     11272    511860   1999292       636      0.03        28
09:20:01 AM     35724    992292     96.52     11320    513532   1999292       636      0.03        28
09:30:01 AM     39012    989004     96.21      8140    505332   1999292       636      0.03        28
09:40:02 AM     35916    992100     96.51       484    515620   1999292       636      0.03         0
09:50:03 AM     38324    989692     96.27       448    515348   1999292       636      0.03         0
10:00:03 AM    139940    888076     86.39       252    371424   1999292       636      0.03         0
10:10:04 AM     35720    992296     96.53       436    518692   1999292       636      0.03         0
10:20:02 AM     37140    990876     96.39       428    515760   1999292       636      0.03         0
10:30:01 AM     34360    993656     96.66       480    520180   1999292       636      0.03         0
10:40:02 AM     40108    987908     96.10       424    504508   1999292       636      0.03         0
10:50:02 AM     35800    992216     96.52       440    517664   1999292       636      0.03         0
Average:        46336    981680     95.49      3102    500902   1999292       636      0.03         8
$ sar -s 09:00:00 -e 11:00:00 -R -f /var/log/sa/sa17
Linux 2.6.22.7-57.fc6 (ti112)   06/17/2010

09:00:01 AM   frmpg/s   bufpg/s   campg/s
09:10:01 AM      1.08      0.19     -0.77
09:20:01 AM     -0.82      0.02      0.71
09:30:01 AM      1.40     -1.36     -3.50
09:40:02 AM     -1.33     -3.28      4.40
09:50:03 AM      1.03     -0.02     -0.12
10:00:03 AM     43.55     -0.08    -61.68
10:10:04 AM    -44.48      0.08     62.85
10:20:02 AM      0.61     -0.00     -1.26
10:30:01 AM     -1.19      0.02      1.89
10:40:02 AM      2.45     -0.02     -6.69
10:50:02 AM     -1.84      0.01      5.63
Average:         0.03     -0.40      0.16
$ sar -s 09:00:00 -e 11:00:00 -n DEV -f /var/log/sa/sa17 | egrep 'lan1|IFACE'
09:00:01 AM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:10:01 AM      lan1   1145.19      0.00 489752.98      0.00      0.00      0.00   1145.19
09:20:01 AM      lan1   1564.38      0.00 725456.33      0.00      0.00      0.00   1564.38
09:30:01 AM      lan1   2104.73      0.00 910104.53      0.00      0.00      0.00   2104.73
09:40:02 AM      lan1  12796.92      0.00 8852604.63      0.00      0.00      0.00  12796.92
09:50:03 AM      lan1  14791.82      0.00 10975987.73      0.00      0.00      0.00  14791.82
10:00:03 AM      lan1  13156.31      0.00 9424034.76      0.00      0.00      0.00  13156.31
10:10:04 AM      lan1  17451.47      0.00 13812805.65      0.00      0.00      0.00  17451.47
10:20:02 AM      lan1  15836.22      0.00 12108763.28      0.00      0.00      0.00  15836.22
10:30:01 AM      lan1  13652.76      0.00 9891058.06      0.00      0.00      0.00  13652.76
10:40:02 AM      lan1  14155.56      0.00 10490290.79      0.00      0.00      0.00  14155.56
10:50:02 AM      lan1  14010.38      0.00 10386290.49      0.00      0.00      0.00  14010.38
Average:         lan1  10962.41      0.00 8000729.59      0.00      0.00      0.00  10962.41
$ ethtool -i lan1
driver: e1000
version: 7.3.20-k2-NAPI
firmware-version: 1.6-12
bus-info: 0000:04:00.1

Here's the logs for a Xeon E5150 with 1GB, running the same app on F12
(2.6.32). This host crashes occassionally under load since upgrading to F12 (it
crashed around 10:00 AM today, as shown below). It also regularly generates page
allocation failures.

$ sar -s 09:00:00 -e 11:00:00 -r -f /var/log/sa/sa17
Linux 2.6.32.10-90.fc12.x86_64 (ti110)  06/17/2010      _x86_64_        (2 CPU)

09:00:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
09:10:01 AM     39300    983424     96.16     51824    406536    481980     15.45
09:20:02 AM     38612    984112     96.22     46464    413604    481980     15.45
09:30:01 AM    165364    857360     83.83      7220    286188    491696     15.76
09:40:02 AM    153480    869244     84.99      1896    347252    492528     15.79
09:50:03 AM     62312    960412     93.91      2120    442892    492224     15.78
10:00:02 AM    193040    829684     81.12      1780    270852    492228     15.78
Average:       108685    914039     89.37     18551    361221    488773     15.67

10:02:16 AM       LINUX RESTART
...
$ sar -s 09:00:00 -e 11:00:00 -R -f /var/log/sa/sa17
Linux 2.6.32.10-90.fc12.x86_64 (ti110)  06/17/2010      _x86_64_        (2 CPU)

09:00:01 AM   frmpg/s   bufpg/s   campg/s
09:10:01 AM      1.72     -8.37      7.75
09:20:02 AM     -0.29     -2.23      2.94
09:30:01 AM     52.77    -16.34    -53.04
09:40:02 AM     -4.91     -2.20     25.24
09:50:03 AM    -37.69      0.09     39.54
10:00:02 AM     54.19     -0.14    -71.32
Average:        10.92     -4.85     -8.09

10:02:16 AM       LINUX RESTART
...
$ sar -s 09:00:00 -e 11:00:00 -n DEV -f /var/log/sa/sa17 | egrep 'lan1|IFACE'
09:00:01 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
09:10:01 AM      lan1    797.12      0.00    319.19      0.00      0.00      0.00    802.30
09:20:02 AM      lan1   1084.88      0.00    472.74      0.00      0.00      0.00   1085.10
09:30:01 AM      lan1   1505.95      0.00    618.34      0.00      0.00      0.00   1498.23
09:40:02 AM      lan1   9207.79      0.00   5931.84      0.00      0.00      0.00   9195.59
09:50:03 AM      lan1  10506.49      0.00   7270.98      0.00      0.00      0.00  10502.26
10:00:02 AM      lan1   9366.62      0.00   6258.04      0.00      0.00      0.00   9324.16
Average:         lan1   5422.93      0.00   3486.58      0.00      0.00      0.00   5412.71
...
$ ethtool -i lan1
driver: e1000e
version: 1.0.2-k2
firmware-version: 1.6-12
bus-info: 0000:04:00.1

Some observations: free memory (kbmemfree) changes drastically under F12
whereas it remains more or less constant around 40MB on FC6 (except for a spike
at 10AM). Pages are freed from the cache (kbcached), possibly for socket
buffers, and then allocated back to the cache, repeatedly. The page cache
remains constant around 500MB on FC6 whereas the peak cache size remains below
500MB on F12.

As mentioned above, after installing 2GB additional memory, I've avoided the
allocation failures and crashing on a system running F12 on hardware identical
to the second system above. However, I'm seeing similar kbmemfree/kbcached
behavior, albeit with a larger page cache (~2GB). Also, average kbmemfree is
higher (around 100MB).

Lastly, on a Xeon E5530 (a newer processor than the two above) with 3GB memory,
running the same workload on F12, I'm seeing memory behavior similar to the
first system above: kbmemfree remains around 50MB and kbcached remains around
2GB.

I've tried tuning the following sysctl parameters:

   - vm.dirty_background_ratio
   - vm.dirty_ratio
   - vm.vfs_cache_pressure
   - vm.swappiness
   - vm.min_free_kbytes

and disabling /sys/block/*/queue/iosched/low_latency. The only settings that
seem to have an effect (i.e. prevent the machine from crashing) are setting
vm.swappiness=40 and vm.min_free_kbytes=49152, but the page allocation failures
are still happening and packets are being dropped. I believe the packet loss is
due to the NIC FIFO filling up (indicated by the rx_missed_errors ethtool
counter) because socket buffers are unavailable to copy to
(alloc_rx_buff_failed counter).

Why does this workload require more memory under F12, but only on the older
processor? Do the page allocation failures have something to do with the page
cache reclaiming? Is there anything else I can tune to improve stability and
reduce packet loss?

Any insight is appreciated.

- Kelvin
-- 
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines


[Index of Archives]     [Current Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [Yosemite Photos]     [KDE Users]     [Fedora Tools]     [Fedora Docs]

  Powered by Linux